Machine Learning to Identify Patients at Risk of Developing New-Onset Atrial Fibrillation after Coronary Artery Bypass

Background: This study aims to get an effective machine learning (ML) prediction model of new-onset postoperative atrial fibrillation (POAF) following coronary artery bypass grafting (CABG) and to highlight the most relevant clinical factors. Methods: Four ML algorithms were employed to analyze 394 patients undergoing CABG, and their performances were compared: Multivariate Adaptive Regression Spline, Neural Network, Random Forest, and Support Vector Machine. Each algorithm was applied to the training data set to choose the most important features and to build a predictive model. The better performance for each model was obtained by a hyperparameters search, and the Receiver Operating Characteristic Area Under the Curve metric was selected to choose the best model. The best instances of each model were fed with the test data set, and some metrics were generated to assess the performance of the models on the unseen data set. A traditional logistic regression was also performed to be compared with the machine learning models. Results: Random Forest model showed the best performance, and the top five predictive features included age, preoperative creatinine values, time of aortic cross-clamping, body surface area, and Logistic Euro-Score. Conclusions: The use of ML for clinical predictions requires an accurate evaluation of the models and their hyperparameters. Random Forest outperformed all other models in the clinical prediction of POAF following CABG.


Introduction
Atrial fibrillation is the most common supraventricular arrhythmia [1,2], and its incidence is dramatically rising worldwide [3]. The number of people with atrial fibrillation (AF) in Europe is expected to double to >17 million by 2060 due to aging populations [4]. Therefore, AF can be considered a 21st-century "cardiovascular disease global epidemic" due to its dramatic medical, social, and economic burden [5][6][7][8][9].
The high incidence of AF recurrence after the treatment imposes a substantial extra burden on the healthcare system due to increased morbidity, mortality, associated therapeutic interventions, and other costs such as patient visits, anticoagulation status, and side effects from drug therapy [14]. Therefore, there is a general agreement amongst experts that there is a pressing need to improve AF treatment [15].
Risk factors for developing POAF after CABG have been identified in several works [25], including general parameters of patients' functional status (older age, low ejection fraction, comorbidities such as chronic obstructive pulmonary disease and chronic renal dysfunction) as well as more specific parameters such as preoperative withdrawal of beta-blocker drugs. However, in most of the available literature, only classical statistical methods have been employed that assume linear relationships between variables and predicted outcomes. They must specify interactions between variables a priori [26].
Recently machine learning (ML) algorithms have been applied in various fields of healthcare [27], having the advantage of identifying non-linear associations between covariates and being able to predict and detect interactions between variables from observed data [28]. Nonetheless, ML algorithms have also been used for predicting the risk of developing POAF only in a few papers and in very small cohorts [29].
This study aims to develop an effective prediction model of POAF following CABG operations and to highlight the most relevant patient and clinical features involved through ML algorithms.

Data Source, Patients Selection, and Definitions
This retrospective study includes three hundred ninety-four patients undergoing CABG at the Cardiothoracic Department (CTC) of Maastricht University Medical Center+ (MUMC+) between 2010 and 2017.
The study included patients above 18 years old undergoing first-time CABG. Patients who had previous cardiac surgery were excluded, as well as those with documented AF or who received anti-coagulant therapy within six months before CABG. No other exclusion criteria were applied.
POAF was defined as an acute or new-onset episode with irregular RR-intervals in an electrocardiogram (ECG) without a traceable p-wave for at least 10 s [30][31][32] and occurring during the postoperative period in-hospital stay.
After excluding underlying medical comorbidities like electrolyte imbalance, amiodarone was started (2.5-5 mg/kg IV over 20 min, then 15 mg/kg). Electrical cardioversion was employed in case of a failed pharmacological attempt, POAF lasting over 48 h, or hemodynamic instability.

Variables and Preliminary Analysis
Variables included demographic characteristics, laboratory data related to renal function, surgical parameters, and postoperative complications. The Logistic Euro-Score was employed, which is largely used in cardiac surgery patients for individual risk prediction, including the very high-risk patient.
A preliminary analysis was performed on variables with zero-and near-zero-variance features (i.e., they had a single value or a handful of unique values that occurred with very low frequencies) that may cause a model to fail or the fit to be unstable [33]. These variables were merged into a single variable to bypass the abovementioned issue.

Data Pre-Processing
Before running ML procedures, some pre-processing steps were carried on. The dataset was split in two at a 75:25 ratio, a training dataset (296 patients) was used to feed the models, and a held-out test dataset (96 patients) was used to assess the performance of the models. The splitting process kept the postoperative AF/non-AF ratio consistent between the datasets.
Numeric variables were centered (subtracted by mean), scaled (divided by standard deviation), and normalized in the range 0 to 1. Categorical variables were one-hot-encoded (each class was created in the un-coded variable). The prediction of missing preoperative creatinine values (the only variable with missing values, 10.1%) was performed by the k-nearest neighbors' method.

Machine Learning Algorithms
Four ML algorithms were employed, and their performances were compared: Multivariate Adaptive Regression Spline (MARS), Neural Network (NN) with three hidden layers, Random Forest (RF), and Support Vector Machine (SVM) with a radial basis kernel function. After the built-in features selection, each algorithm was applied to the training dataset to build the predictive model. The following functions were called to create the models: "earth" with MARS, "mlpMl" with NN, "rf" with RF, and "svmRadial" with SVM.
A hyperparameters search was adopted to optimize each model for better performance, and the Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) metric was selected to choose the best model. The search consisted of two phases: a preliminary step to detect a plausible set of values and a grid criteria step to fine-tune the editable hyperparameters in the functions used.
The hyperparameters taken into account by the individual models were as follows: with MARS, nprune (maximum number of terms (including the intercept) in the pruned model) and degree (maximum degree of interaction); with NN, the size of the three hidden layers; with RF, mtry (number of variables randomly sampled as candidates at each split); with SVM, sigma (inverse kernel width) and cost (cost regularization parameter, it controls the smoothness of the fitted function-higher values lead to less smooth functions).
A 10-fold cross-validation method was set during the training step as a resampling method to validate each model. Furthermore, the importance of the variables was estimated. An up-sampling technique (randomly replicated instances, with replacement, in the minority class) was employed during the training step to address the class imbalance in the training dataset. Moreover, ROC AUC, sensitivity, and specificity resampling result differences between models were estimated. Finally, each model's best instances (according to the chosen metric, ROC AUC) were fed with the held-out test dataset (the learning steps were never applied to this data). For each prediction, a confusion matrix was generated. The following values were calculated: accuracy (true positive and true negative cases divided by all cases), sensitivity or recall (true positive cases divided by positive reference events), specificity (true negative cases divided by negative reference events), precision or positive predicted value (true positive divided by predicted positive events), negative predicted value (true negative divided by predicted negative events), F1 value (harmonic mean of precision and recall values).
Lastly, all tested models' ROC and Precision-Recall (PR) curves were processed. The PR curve is typically employed for assessing model performances when the outcome class is very unbalanced (for example, 1% vs. 99%). We adopted the PR curve even if the outcome class of our study is not that overly unbalanced (10.2% POAF vs. 89.8% non-POAF).
Finally, as age is a well-known independent factor for AF initiation, the machine learning algorithms were trained only on the age feature to predict POAF events. The predictive power (ROC AUC) was compared with the models trained on the other features.
The analysis was carried out using R Core Team (2021) (R: A language and environment for statistical computing, version 4.1.2. R Foundation for Statistical Computing, Vienna, Austria); and by caret package, version 6.0-86 [33].

Traditional Logistic Regression
For comparison purposes only, the logistic regression was also calculated on the same training data, and then the accuracy was verified on the test data.

Statistical Significance
We assumed a statistically significant p-value < 0.05.

Pre-Processing
After splitting the initial data set in two, the training dataset had 296 cases; the test data set had 98 cases. Each dataset had the same initial value of POAF/non-POAF patient ratio. In the supplementary material, Tables S1 and S2 report the descriptive statistics of training and test datasets, respectively. Figure 1A-D shows the various iterations of the hyperparameter search performed with MARS, NN, RF, and SVM models after the last run. The graphs report the variations of the ROC AUC as a function of one or more parameters of each model. The maximum value for the chosen metric (ROC AUC) was caught for each model, and the correspondent best model was saved as follows: with MARS, nprune = 5 and degree = 4; with NN, layer1 = 7, layer2 = 9, layer3 = 6; with RF, mtry = 1; with SVM, sigma = 0.002 and cost = 2. Table 2 shows the obtained maximum values of ROC AUC, sensitivity, and specificity resampling values for each model. The maximum ROC AUC value (0.95) was obtained by using the SVM model, the maximum sensitivity value (1) by using the MARS, NN, and SVM models, and the maximum specificity value (1) by using the NN model. Table 3 shows the estimated differences between the metrics values reported in Table 2 and the p-value (Bonferroni adjustment). ROC AUC does not show any significant difference. At the same time, sensitivity yields a statistically significant value (0.50) with (NN-RF) (p-value, 0.03), and specificity yields a statistically significant value (−0.19) with (MARS-RF) (p-value, 0.05). Figure 2 additionally shows confidence intervals of the only statistically significant differences.

Features Selected by the Models
Relevant features selected by MARS, NN, RF, and SVM algorithms are displayed in Figure 3A, Figure  Employing the RF model, the top five features were age (100.0%), preoperative creatinine values (86.1%), time of aortic cross-clamping (82.2%), body surface area (80.9%), and Logistic Euro-Score (80.7%).
Finally, with the SVM model, the top five features were age (100.0%), the number of distal anastomoses (65.0%), use of double mammary artery (60.6%), the performance of T-graft anastomosis (60.5%), and Operation Year 2013 (50.4%).   Table 2 shows the obtained maximum values of ROC AUC, sensitivity, and specific ity resampling values for each model. The maximum ROC AUC value (0.95) was obtained by using the SVM model, the maximum sensitivity value (1) by using the MARS, NN, and SVM models, and the maximum specificity value (1) by using the NN model.

Features Selected by the Models
Relevant features selected by MARS, NN, RF, and SVM algorithms are displayed Figure 3A, 3B, 3C, and 3D, respectively. The measures of importance are scaled to hav maximum value of 100. With the MARS model, the top five features were age (100.0% Figure 2. Statistically significant estimated differences of the metrics values with confidence intervals and a vertical line indicating the points with zero difference. The most statistically significant difference in sensitivity and specificity is between the NN and RF and MARS and RF models.

Testing
After feeding the models with the test data set, the confusion matrices were calculated from the predicted values (at 0.5 classification threshold) and the true outcome values. Table 4 shows metrics values from the confusion matrices: accuracy, sensitivity (or recall), specificity, precision (or positive predictive value), negative predictive value, and F1 values for each model. RF model reaches the maximum values for all parameters, but the sensitivity for which the maximum value (0.70) is associated with the MARS model.  Employing the RF model, the top five features were age (100.0%), preoperative creatinine values (86.1%), time of aortic cross-clamping (82.2%), body surface area (80.9%), and Logistic Euro-Score (80.7%).

Testing
After feeding the models with the test data set, the confusion matrices were calculated from the predicted values (at 0.5 classification threshold) and the true outcome values. Table 4 shows metrics values from the confusion matrices: accuracy, sensitivity (or recall), specificity, precision (or positive predictive value), negative predictive value, and F1 values for each model. RF model reaches the maximum values for all parameters, but the sensitivity for which the maximum value (0.70) is associated with the MARS model.   To consider the information of which classification threshold results in a certain point of the curves, we implemented a color scale in Figure 4A-D to show the ROC curve and in Figure 5A-D to show the PR curves with MARS, NN, RF, and SVM models, respectively. The color scale is a helpful tool to assess sensitivity variation as a function of one minus specificity (ROC curve) and precision as a function of recall (PR curve) for different classification threshold values. Typically, the metrics shown in Table 4 are calculated at the 0.50 classification threshold. The value of 0.50 represents the cut-off to decide if, during the prediction step, the patient outcome is classified to belong to one class or another (in our study, POAF or not-POAF). Of course, the value of 0.50 is not the only possible value. So the colored scale allows us to check how the metrics change as the classification cut-off changes. For example, looking at Figure 4C  So the colored scale allows us to check how the metrics change as the classification cut-off changes. For example, looking at Figure 4C (RF model), with a cut-off (or threshold) of about 0.25 (yellow color), the sensitivity is about 0.9, and the false positive rate is about 0.75. With a cut-off of about 0.7 (blue color), the sensitivity is about 0.2, and the false positive rate is about 0.04. Ultimately, there is a trade-off between conflicting sensitivity and false positive rate values, and these depend on the chosen cut-off value.    Table 1, age is significantly higher in the POAF group compared to the non-POAF group. Moreover, Figure 3A-D show that age is the most important feature (100%) selected by all four models. To prove the features were significantly predictive of POAF, we trained the models only on the age feature despite POAF's strong age dependence. Prediction ROC AUC comparison with the models trained on other features yielded the following results

Logistic Regression
The accuracy of the logistic regression model was 64% (95% CI (54%, 74%)), and the ROC AUC value was 0.64 on the test (held-out) data set.

Discussion
This paper aimed to provide an effective prediction model of POAF following CABG and highlight the most relevant patient and clinical features selected through ML approaches. The application of ML for clinical predictions requires an accurate evaluation of the models and their hyperparameters before choosing the suitable model targeted for the specific purpose.
One of the more exciting features of our work was that none of the models reached the highest ROC, sensitivity, and specificity together. Indeed, the NN had the highest ROC and sensitivity, while RF obtained the highest specificity. Given the unbalanced nature of the dataset, PR curves were analyzed together with ROC curves, and the RF model demonstrated better performance. The RF model showed that age (100.0%), preoperative creatinine values (86.1%), time of aortic cross-clamping (82.2%), body surface area (80.9%), Logistic Euro-Score (80.7%), and extracorporeal circulation time (65.7%) were the predictors with a normalized contribution to the model greater than 40%.
Analyzing the confusion matrices of the test data, the RF model reaches the highest values of the parameters of measurement of the prediction performance, except for the sensitivity value (0.60), which is the same as the other models (the MARS model has a slightly higher value). Furthermore, with the RF model, the sensitivity and specificity values of the prediction are lower than the corresponding maximum values of the resampling, reflecting that the RF model has learned to generalize better than the other models, also considering the maximum value resampling of the ROC AUC.
The confusion matrices were calculated at the threshold value of 0.50, one of the possible referring values. Therefore, the comparative evaluation of the performance of the models cannot be based only on confusion matrices. Still, the evaluation of the ROC and PR curves are necessary for an overall classification goodness measure.
The ROC AUC analysis performed with the test data showed that RF had the highest values. Nevertheless, more than the ROC curve examination is required to represent the goodness of the prediction fully. When data are unbalanced, it is recommendable using the PR curve. As our data were moderately unbalanced, we used the PR curves to confirm the ROC curves' feature further.
Observing the graph of the PR curve for the RF model, for threshold values from about 0.7 to 0.8, the precision is very high (equal to 1), but the recall is relatively low (between 0.0 and 0.2). As soon as the threshold value is reduced (<0.6, >0.4), the precision significantly lowers, remaining around 0.4. Nonetheless, in this setup, the recall value rises. Further decreases in the threshold value (<0.4) improve recall but worsen precision until the baseline value is reached for threshold values <0.2.
Finally, we noted that the performance of the traditional logistic regression was lower than that obtained with the RF model.

Clinical Considerations
Prediction models for incident AF have been employed to contribute to AF screening by determining a risk category for each patient [42]. In particular, the CHARGE-AF appeared most suitable for primary screening purposes [43]. Atrial fibrillation [AF] occurs in 20% to 40% of patients after CABG [18,[44][45][46][47][48]. Models have been developed to identify patients at high risk for the development of AF after CABG [49]. Nonetheless, classical statistical methods assume linear relationships between variables and predicted outcomes leading to biased results.
In this project, we propose an artificial intelligence (AI)-based prediction model to provide an effective prediction model of POAF following CABG and highlight the most relevant patient and clinical features selected through ML approaches. This is the first attempt to identify the best predictor model for future clinical application in larger population cohorts. In addition, this is the first step that will lead to implementing such a model for clinical inferences and designing a risk score to be used at the patient's bedside.
Age is a well-known independent predictor of POAF [50,51]. This can be explained by age-related structural changes, such as increased fibrosis and atrial dilatation [52] and changes in atria's electrophysiological properties, which predispose to the development of AF [22]. Furthermore, related comorbidities in older patients may be responsible for the increased incidence of POAF in the elderly [53]. The importance of Euro-Score confirms this as a predictor, reflecting the severe status of the patients with associated cardiovascular and non-cardiovascular morbidities [54].
In our model, extracorporeal circulation time (ECC) and cardiopulmonary bypass time (CPB) are significantly related to POAF. CPB has been associated with an ischemiareperfusion injury-inducing a complex inflammatory response, which has been reported in patients with AF. These range from inflammatory infiltrates in atrial biopsies to increased concentrations of C-reactive protein, which form the substrate for the generation of ectopic activity [55,56]. Nonetheless, it is still controversial whether CABG performed on the beating heart without ECC and CC reduces the incidence of POAF [54].
The mechanism of how POAF is influenced by low renal function has yet to be fully understood. Nonetheless, the increased incidence of hypertension, fluid overload, and pathological activation of the intrarenal renin-angiotensin-aldosterone might explain this association [57]. In addition, renal dysfunction was associated with both electrical and structural remodeling of LA, which might be the mechanism underlying the pathophysiology of new-onset POAF [58].
Finally, body surface area was an independent risk factor for new-onset AF, confirming previous reports [59,60]. Other studies have shown that BSA is only a risk factor for POAF in older patients [61]. Increased left atrium stretch, diastolic dysfunction [62], and high plasma volume secondary to obesity [63] have been proposed as mechanisms for the vulnerability of the left atrium to the development of POAF.

Limitations
The study presents some inherent limitations that need to be highlighted. First, the predictive models were not tested and validated on cohorts from other centers. Second, the hyperparameters search was limited by available hardware. Third, the small number of patients likely reduces the prediction capacity of the trained models. However, we preferred testing our models on actual clinical data accepting a limited cohort. Fourth, the initial data set might include only some AF predictive variables; more risk factors can lead to more precise models. Nonetheless, this was the first attempt toward an upcoming accurate score model based on ML. Finally, the relatively low level of ROC AUC obtained could be due to the low number of cases or the available variables' low prediction capacity. Alternatively, it could be due to the limited refinement of the hyperparameters.
However, this limitation is shared with many previously published papers employing actual data.
In addition, the lack of ECG data represents a further limitation. We are working on ECG-ML-reading procedures that will be the objective of a forthcoming paper. Finally, we should have carried on external validation datasets to prove the generalizability of the models.
Finally, we did not explore whether POAF persisted beyond the discharge from the hospital. A machine learning analysis of who persists in POAF after CABG, despite rhythm control, would be very interesting, and it is a call for further research.

Conclusions
Random Forest is best performed in the clinical prediction of postoperative atrial fibrillation following coronary artery bypass grafting. The ML technique is promising for more sophisticated and accurate AI-based risk score models in this setting.
Further research employing other ML methods and more observations is warranted to yield more accurate ML predictive performance.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to GDPR privacy restrictions.